-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate partition memory usage based on previous attempts #11857
Conversation
@arhimondr do we need RN for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comments
.../trino-main/src/main/java/io/trino/execution/scheduler/ConstantPartitionMemoryEstimator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/execution/scheduler/ExponentialGrowthPartitionMemoryEstimator.java
Outdated
Show resolved
Hide resolved
bd5e8b0
to
835f201
Compare
Rebased on top of #11861 |
835f201
to
f7ea0b7
Compare
AC |
package io.trino.execution.scheduler; | ||
|
||
@FunctionalInterface | ||
public interface PartitionMemoryEstimatorFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: why having a factory is beneficial here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want a new instance for each stage, so we can do estimations based on different tasks which completed for this stage.
core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/execution/scheduler/ExponentialGrowthPartitionMemoryEstimator.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/execution/scheduler/ExponentialGrowthPartitionMemoryEstimator.java
Outdated
Show resolved
Hide resolved
// take previousRequiredBytes into account when registering failure on oom. It is conservative hence safer (and in-line with getNextRetryMemoryRequirements) | ||
long previousRequiredBytes = previousMemoryRequirements.getRequiredMemory().toBytes(); | ||
long previousPeakBytes = peakMemoryUsage.toBytes(); | ||
memoryUsageDistribution.add(Math.max(previousRequiredBytes, previousPeakBytes) * growthFactor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we add estimated memory usage to the distribution? If retry succeeds, then the actual usage will be added, right? This seems to skew the metric.
I understand your intention though. If one task consumes large amount of memory, then other tasks may also need large amount of memory. But this will make the stats collection inaccurate, maybe we should explore some other approach instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - this surely is not exact science and I am not sure how well it will work in practice. But intention is exactly what you wrote. If we see that tasks are dying because we gave them too little memory we want to bump initial memory already for new tasks. Not wait until we have one which succeeds (it may take a lot of time till we have one).
I was thinking first about having two separate histograms for successful and unsuccessful tries. And make the one for unsuccessful decaying over time so "newer data is more important" - but I did not come up with reasonable way to merge the data from both, so I implemented the simple (yet I agree not 100% bullet-proof) approach.
Happy to hear suggestions how to improve though :)
BTW: I will add a commit on top with extra debug logging so we can see how it works in practice when testing out queries on cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. Actually I guess with some tuning your approach might work well in practice. We can leave it as it is for now.
f7ea0b7
to
6ea2208
Compare
6ea2208
to
57457cf
Compare
Description
Estimate partition memory usage based on previous attempts.
This applies for execution with task-level retries when bin-packing node allocator is selected
improvement
core engine
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
( ) Release notes entries required with the following suggested text: